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Abstract 

Recently, the Frank-Wolfe optimization algo¬ 
rithm was suggested as a procedure to ob¬ 
tain adaptive quadrature rules for integrals 
of functions in a reproducing kernel Hilbert 
space (RKHS) with a potentially faster rate 
of convergence than Monte Carlo integration 
(and “kernel herding” was shown to be a spe¬ 
cial case of this procedure). In this paper, 
we propose to replace the random sampling 
step in a particle filter by Frank-Wolfe op¬ 
timization. By optimizing the position of 
the particles, we can obtain better accuracy 
than random or quasi-Monte Carlo sampling. 

In applications where the evaluation of the 
emission probabilities is expensive (such as in 
robot localization), the additional computa¬ 
tional cost to generate the particles through 
optimization can be justified. Experiments 
on standard synthetic examples as well as on 
a robot localization task indicate indeed an 
improvement of accuracy over random and 
quasi-Monte Carlo sampling. 

1 Introduction 

In this paper, we explore a way to combine ideas from 
optimization with sampling to get better approxima¬ 
tions in probabilistic models. We consider state-space 
models (SSMs, also referred to as general state-space 
hidden Markov models), as they constitute an impor¬ 
tant class of models in engineering, econometrics and 
other areas involving time series and dynamical sys¬ 
tems. A discrete-time, nonlinear SSM can be written 
as 

x t |xi;( t _i) ~ p{x t \x t -i)\ y t \xv.t ~p{y t \x t ), (1) 
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where Xt € X denotes the latent state variable and 
y t € y the observation at time t. Exact state infer¬ 
ence in SSMs is possible, essentially, only when the 
model is linear and Gaussian or when the state-space 
A is a finite set. For solving the inference problem be¬ 
yond these restricted model classes, sequential Monte 
Carlo methods, i.e. particle filters (PFs), have emerged 
as a key tool; see e.g., Doucet and Johansen (2011); 
Cappe et al. (2005); Doucet et al. (2000). However, 
since these methods are based on Monte Carlo integra¬ 
tion they are inherently affected by sampling variance, 
which can degrade the performance of the estimators. 

Particular challenges arise in the case when the ob¬ 
servation likelihood p(yt\xt ) is computationally expen¬ 
sive to evaluate. For instance, this is common in 
robotics applications where the observation model re¬ 
lates the sensory input of the robot, which can com¬ 
prise vision-based systems, laser rangefinders, syn¬ 
thetic aperture radars, etc. For such systems, simply 
evaluating the observation function for a fixed value 
of x t can therefore involve computationally expensive 
operations, such as image processing, point-set regis¬ 
tration, and related tasks. This poses difficulties for 
particle-filtering-based solutions for two reasons: (1) 
the computational bottleneck arising from the like¬ 
lihood evaluation implies that we cannot simply in¬ 
crease the number of particles to improve the accuracy, 
and (2) this type of “complicated” observation models 
will typically not allow for adaptation of the proposal 
distribution used within the filter, in the spirit of Pitt 
and Shephard (1999), leaving us with the standard— 
but inefficient —bootstrap proposal as the only viable 
option. On the contrary, for these systems, the dy¬ 
namical model p(xt\xt~i) is often comparatively sim¬ 
ple, e.g. being a linear and Gaussian “nearly constant 
acceleration” model (Ristic et al., 2004). 

The method developed in this paper is geared toward 
this class of filtering problems. The basic idea is that, 
in scenarios when the likelihood evaluation is the com¬ 
putational bottleneck, we can afford to spend addi¬ 
tional computations to improve upon the sampling of 
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the particles. By doing so, we can avoid excessive vari¬ 
ance arising from simple Monte Carlo sampling from 
the bootstrap proposal. 

Contributions. We build on the optimization view 
from Bach et al. (2012) of kernel herding (Chen et al., 
2010) to approximate the integrals appearing in the 
Bayesian filtering recursions. We make use of the 
Frank-Wolfe (FW) quadrature to approximate, in par¬ 
ticular, mixtures of Gaussians which often arise in a 
particle filtering context as the mixture over past parti¬ 
cles in the distribution over the next state. We use this 
approach within a filtering framework and prove the¬ 
oretical convergence results for the resulting method, 
denoted as sequential kernel herding (SKH), giving one 
of the first explicit better convergence rates than for a 
particle filter. Our preliminary experiments show that 
SKH can give better accuracy than a standard particle 
filter or a quasi-Monte Carlo particle filter. 

2 Adaptive quadrature rules with 
Frank-Wolfe optimization 


nijp) := E p [$] £ T~L (Smola et al., 2007; Sriperumbudur 
et al., 2010). Essentially, by using Cauchy-Schwartz 
inequality and the linearity of the expectation opera¬ 
tor, we can obtain: 


sup \E p [f] -Ep[/]| 
/ 6 « 
ll/llw<l 


IIMp)-Mp)II« 

=: MMD(p,p), 


(3) 


and so by bounding MMD(p,p), we can bound the er¬ 
ror of approximating the expectation for all / £ "H, 
with ||/||% as a proportionality constant. MMD(p/) 
is thus a central quantity for developing good quadra¬ 
ture rules given by (2). In the context of RKHSs, 
MMD(p, q) can be called the maximum mean discrep¬ 
ancy (Gretton et al., 2012) between the distributions 
p and q , and acts a pseudo-metric on the space of dis¬ 
tributions on X. If re is a characteristic kernel (such 
as the standard RBF kernel), then MMD is in fact a 
metric, i.e. MMD(p, q) = 0 => p = q. We refer the 
reader to Sriperumbudur et al. (2010) for the regular¬ 
ity conditions needed for the existence of these objects 
and for more details. 


2.1 Approximating the mean element for 
integration in a RKHS 

We consider the problem of approximating integrals 
of functions belonging to a reproducing kernel Hilbert 
space (RKHS) T~L with respect to a fixed distribution p 
over some set X. We can think of the elements of 
H as being real-valued functions on X , with point- 
wise evaluation given from the reproducing property 
by f(x) = (/, d>(j:)), where $ : A — H is the fea¬ 
ture map from the state-space X to the RKHS. Let 
re : X 2 —» R. be the associated positive definite ker¬ 
nel. We briefly review here the setup from Bach et al. 
(2012), which generalized the one from Chen et al. 
(2010). We want to approximate integrals E p [f] for 
/ £ T~L using a set of n points x^\ ..., a/") g X asso¬ 
ciated with positive weights w^\ ..., w' n ^ which sum 
to 1: 

n 

Ep[/]«E u;W /( a;(i) ) = E p[/]’ (2) 

i=l 

where p := i) is the associated empirical 

distribution defined by these points and S x (-) is a point 
mass distribution at x. If the points a/ 1 ) are inde¬ 
pendent samples from p, then this Monte Carlo esti¬ 
mate (using weights of 1 /n) is unbiased with a vari¬ 
ance of Y p [f]/n, where V p [/] is the variance of / with 
respect to p. By using the fact that / belongs to the 
RKHS T~L, we can actually choose a better set of points 
with lower error. It turns out that the worst-case error 
of estimators of the form (2) can be analyzed in terms 
of their approximation distance to the mean element 


2.2 FVank-Wolfe optimization for adaptive 
quadrature 

For getting a good quadrature rule p, our goal is thus 
to minimize || p{p) — p(p)\\n- We note that p(p) lies in 
the marginal polytope A i G 71, defined as the closure of 
the convex-hull of ${X). We suppose that $(2;) is uni¬ 
formly bounded in the feature space, that is, there is a 
finite R such that ||$(x)||% < R Vx £ X. This means 
that At is a closed bounded convex subset of %, and we 
could in theory optimize over it. This insight was used 
by Bach et al. (2012) who considered using the Frank- 
Wolfe optimization algorithm to optimize the convex 
function J(g ) := \\\g — /r(p)||% over At to obtain 
adaptive quadrature rules. The Frank-Wolfe algorithm 
(also called conditional gradient) (Frank and Wolfe, 
1956) is a simple first-order iterative constrained op¬ 
timization algorithm for optimizing smooth functions 
over closed bounded convex sets like At (see Dunn 
(1980) for its convergence analysis on general infinite 
dimensional Banach spaces). At every iteration, the 
algorithm finds a good feasible search vertex of At by 
minimizing the linearization of J at the current iter¬ 
ate g k : g k +i = argmin geM (J'{g k ), g)■ The next iter¬ 
ate is then obtained by a suitable convex combination 
of the search vertex g k +i and the previous iterate g k : 
g k+ i = (1 - 7 k )g k + 7fc5fc+i for a suitable step-size y fc 
from a fixed schedule (e.g. l/(k + 1)) or by using line- 
search. A crucial property of this algorithm is that the 
iterate g k is thus a convex combination of the vertices 
of At visited so far. This provides a sparse expan¬ 
sion for the iterate, and makes the algorithm suitable 
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to high-dimensional optimization (or even infinite) - 
this explains in part the regain of interest in machine 
learning in the last decade for this old optimization al¬ 
gorithm (see Jaggi (2013) for a recent survey). In our 
setup where Ad is the convex hull of < 1 >(A’), the vertices 
of Ad are thus of the form gu+i = d>(a/ fc+1 )) for some 
x ( k + 1 ) £ x. Running Frank-Wolfe on Ad thus yields 
9k = X!i=i w k ^( x ^) == Ep[*&] for some weighted set 
of points {w^ The iterate gk thus corre¬ 

sponds to a quadrature rule p of the form of ( 2 ) and 
gk = Ep[d>], and this is the relationship that was ex¬ 
plored in Bach et al. (2012). Running Frank-Wolfe op¬ 
timization with the step-size of 7 *, = l/(fc + 1 ) reduces 
to the kernel herding algorithm proposed by Chen 
et al. (2010). See also Huszar and Duvenaud (2012) 
for an alternative approach with negative weights. 

Algorithm 1 presents the Frank-Wolfe optimization al¬ 
gorithm to solve min ge ^i J(g) in the context of get¬ 
ting quadrature rules (we also introduce the short¬ 
hand notation p p := p(p)). We note that to evalu¬ 
ate the quality MMD(p,p) of this adaptive quadra¬ 
ture rule, we need to be able to evaluate p p (x) = 
J x , eX p( x ')k(x', x)dx' efficiently. This is true only for 
specific pairs of kernels and distributions, but fortu¬ 
nately this is the case when p is a mixture of Gaussians 
and k is a Gaussian kernel. This insight is central to 
this paper; we explore this case more specifically in 
Section 2.3. To find the next quadrature point, we 
also need to (approximately) optimize p p (x) over X 
(step 3 of Algorithm 1 , called the FW vertex search). 
In general, this will yield a non-convex optimization 
problem, and thus cannot be solved with guarantees, 
even with gradient descent. In our current implemen¬ 
tation, we approach step 3 by doing an exhaustive 
search over M random samples from p precomputed 
when FW-Quad is called. We thus follow the idea 
from the kernel herding paper (Chen et al., 2010) to 
choose the best N “super-samples” out of a large set 
of samples M. Thanks to the fact that convergence 
guarantees for Frank-Wolfe optimization can still be 
given when using an approximate FW vertex search, 
we show in Appendix B of the supplementary material 
that this procedure either adds a 0 ( 1 /Af 1 / 4 ) term or 
a 0(1 /VM) term to the worst-case MMD(ji,p) error. 

In our description of Algorithm 1, a preset number N 
of particles (iterations) was used. Alternatively, we 
could use a variable number of iterations with the ter¬ 
minating criterion test ||gk — y(p)\\u < e which can 
be explicitly computed during the algorithm and pro¬ 
vides the MMD error bound on the returned quadra¬ 
ture rule. Option (2) on line 5 chooses the step-size 
7 ^ by analytic line-search (hereafter referred as the 
FW-LS version) while option (1) chooses the kernel 
herding step-size 7 *, = l/(fc + 1 ) (herafter referred as 


the FW version) which always yields uniform weights: 
w k' > = 1/& f° r all i < k. A third alternative is 
to re-optimize J(g) over the convex hull of the pre¬ 
viously visited vertices; this is called the fully cor¬ 
rective version (Jaggi, 2013) of the Frank-Wolfe al¬ 
gorithm (hereafter referred as FCFW). In this case: 
r (!) (fc+i)\ • t is 

«+!.-• •>'<+! ) = axgmm w&Ah+i w'K k+1 w - 
2 cj +1 w, where A*. +1 is the (k + l)-dimensional prob¬ 
ability simplex, Kk +1 is the kernel matrix on the 
(ft + 1) vertices: (Kk+i)ij = k{x^\x^) and {ck+i)i = 
p p (x^) for i = 1, ...,(& +1). This is a convex 
quadratic problem over the simplex. A slightly modi¬ 
fied version of the FCFW is called the min-norm point 
algorithm and can be more efficiently optimized us¬ 
ing specific purpose active-set algorithms — see Bach 
(2013, §9.2) for more details. We refer the reader 
to Bach et al. (2012) for more details on the rate of con¬ 
vergence of Frank-Wolfe quadrature assuming that the 
FW vertex is found with guarantees. We summarize 
them as follows: if H is infinite dimensional, then FW- 
Quad gives the same 0(l/y/~N) rate for the MMD error 
as standard random sampling, for all FW methods. 
On the other hand, if a ball of non-zero radius cen¬ 
tered at p, p lies within Ad, then faster rates than ran¬ 
dom sampling are possible: FW gives a 0(1/IV) rate 
whereas FW-LS and FCFW gives exponential conver¬ 
gence rates (though in practice, we often see differences 
not explained by the theory between these methods). 

2.3 Example: mixture of Gaussians 

We describe here in more details the Frank-Wolfe 
quadrature when p is a mixture of Gaussians p(x) = 
7r*A/"(a;|Atj, E») for A = and k is the Gaus¬ 
sian kernel K a (x,x') := exp(— — x'\\ 2 ). In this 

case, p, p (x) = J2f =1 ^i(V^o-) d Af(x\p,i,T,i + a 2 Id)- We 
thus need to optimize a difference of mixture of Gaus¬ 
sian bumps in step 3 of Algorithm 1, a non-convex 
optimization problem that we approximately solve by 
exhaustive search over M random samples from p. 

3 Sequential kernel herding 

3.1 Sequential Monte Carlo 

Consider again the SSM in (1). The joint probabil¬ 
ity density function for a sequence of latent states 
X\-t '■= (aq ,...,Xt) and observations y-\ factor¬ 
izes as p(x\ : Ti Vi-.t) = Ylt=iP( x t\ x t-i)p(yt\xt), with 
p{xi\xq) := P( x 1 ) denoting the prior density on the 
initial state. We would like to do approximate in¬ 
ference in this SSM. In particular, we could be in¬ 
terested in computing the joint filtering distribution 
rt(xi-.t) := p(xi : t\yi:t) or the joint predictive distri¬ 
bution p t+1 (x t +i, Xi:t) := p{x t +i,Xi :t \yi:t). In parti- 



Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering 


Algorithm 1 FW-Quad(p, 'LL, N ): Frank-Wolfe adap¬ 
tive quadrature 

Input: distribution p, RKHS 'LL which defines ker¬ 
nel k(-, •) and state-space X , number of samples N 
1: Let go = 0. 

2: for k = 0... N — 1 do 

3: Solve x^ k+1 ^ = argmin(< 7 fc — / i p , $(x)) 

That is: 

k 

x( fe+ i) = argminV^ (k(x^,x) — p p (x)). 

4: Option (1): Let 7fc = ^r^;. 

5: Option (2): Let 7fc = <9fc |^’$(^lo)p ))> ( LS ) 

6: Update g fe+ i = (1 - 'y k )gk + 7fe^(^ (fc+1) ) 

(fc+i) 

i-e. w K k+1 = 7fcj 

and = (1 — 7 k) w k ^ for i = 1... k 

7: end for 

8: Return: P=Eil 1 w v4w 


cle filtering methods, we approximate these distribu¬ 
tions with empirical distributions from weighted parti¬ 
cle sets {w^\ as in (2). We note that it is easy 

to marginalize p with a simple weight summation, and 
so we will present the algorithm as getting an approx¬ 
imation for the joint distributions r t and pt defined 
above, with the understanding that the marginal ones 
are easy to obtain afterwards. In the terminology of 
particle filtering, x is the particle at time t, whereas 
x\. t is the particle trajectory. While principally the PF 
provides an approximation of the full joint distribution 
rt{x\..t)i it is well known that this approximation dete¬ 
riorates for any marginal of x s for sCi (Doucet and 
Johansen, 2011). Hence, the PF is typically only used 
to approximate marginals of x s for s < t (fixed-lag 
smoothing) or s = t (filtering), or for prediction. 

Algorithm 2 presents the bootstrap particle filtering al¬ 
gorithm (Gordon et al., 1993) from the point of view of 
propagating an approximate posterior distribution for¬ 
ward in time (see e.g. Fearnhead, 2005). We describe 
it as propagating an approximation pt{x+t) of the joint 
predictive distribution one time step forward with the 
model dynamics to obtain p t+1 (x t+1 , X\ :t ) (step 5), 
and then randomly sampling from it (step 3) to get 
the new predictive approximation pt+i(xt+i, x± : t) ■ As 
Pt is an empirical distribution, pt+i is a mixture distri¬ 
bution (the mixture components are coming from the 
particles at time t): 

Pt+i(x t+ i,xi :t ) = 

I N 

_£p(rfK« p{x t+1 \x { f ) ) 8 (,i){x 1:t ). (4) 

mixture weight mixture component 


Algorithm 2 Particle filter template (joint predictive 
distribution form) — SKH alg. by changing step 3 

Input: SSM p(x t \xt-i), 

o t (x t ) := p(yt\x t ) for t G 1 : T. 

Maintain pt{x l:t ) = during algo¬ 

rithm as approximation of p{x t , x 1 .( t _ 1 ) |yi ; (t-i))- 
1: Let Pi(xi) :=p(x i) 

2: for t=l ..., T do 

3: Sample: get p t = SAMPLE(p t , N) 

[For SKH, use p t = FW-Quad(p t , Lit, A)] 

4: Include observation and normalize: 

W t = Ep t [o t \ ; r t (xi, t ) := ^-o t (x t )p t (xi :t ). 

5: Propagate approximation forward: 

Pt+i(xt+i,xi :t ) := p(x t +i\xt)f t (xi:t) 

6: end for 

7: Return Filtering distribution fr; predictive 
distribution Pt+i] normalization constants 

W 1 ,...,W T . 


We denote the conditional normalization constant at 
time t by Wt := p(yt\yi:(t-i)) and the global normal¬ 
ization constant by Z t := p(y 1:t ) = nL=i W u- W t 
is the particle filter approximation to Wt and is ob¬ 
tained by summing the un-normalized mixture weights 
in (4); see step 4 in Algorithm 2. Randomly sam¬ 
pling from (4) is equivalent to first sampling a mix¬ 
ture component according to the mixture weight (i.e., 
choosing a past particle x\ l ) t to propagate), and then 
sampling its next extension state xJG with probability 

p(x t +i\x[ l ' > ). The standard bootstrap particle filter is 
thus obtained by maintaining uniform weight for the 
predictive distribution = j^) and randomly sam¬ 
pling from (4) to obtain the particles at time f+1. This 
gives an unbiased estimate of pt+i : Ep t+1 [pt+i] = Pt+i ■ 
Lower variance estimators can be obtained by using a 
different resampling mechanism for the particles than 
this multinomial sampling scheme, such as stratified 
resampling (Carpenter et al., 1999) and are usually 
used in practice instead. 

One way to improve the particle filter is thus to re¬ 
place the random sampling stage of step 3 with differ¬ 
ent sampling mechanisms with lower variance or bet¬ 
ter approximation properties of the distribution p t +1 
that we are trying to approximate. As we obtain the 
normalization constants Wt by integrating the obser¬ 
vation probability, it seems natural to look for particle 
point sets with better integration properties. By re¬ 
placing random sampling with a quasi-random number 
sequence, we obtain the already proposed sequential 
quasi-Monte Carlo scheme (Philomin et al., 2000; Or- 
moneit et al., 2001; Gerber and Chopin, 2014). The 
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main contribution of our work is to instead propose to 
use Frank-Wolfe quadrature in step 3 of the particle 
filter to obtain better (adapted) point sets. 

3.2 Sequential kernel herding 


X t +\ defined (depending on T-L t + 1 ) as all functions for 
which the following semi-norm is finite: 1 


ll/lk ■= sup 

IW|-K t+ l =1 



f(x t+ i)h(x t+ i)dxt + i 


In the sequential kernel herding (SKH) algorithm, 
we simply replace step 3 of Algorithm 2 with p t = 
FW-Quad(p t ,% t , N). As mentioned in the introduc¬ 
tion, many dynamical models used in practice assume 
Gaussian transitions. Therefore, we will put par¬ 
ticular emphasis on the case when (more generally) 
p(xt\xi : ( t -i), 2 /i:(t_i)) is a mixture of Gaussians, with 
parameters for the mixture components that can be ar¬ 
bitrary functions of the state history X\ : i t -\),yi-.(t-\), 
and is thus still fairly general. We thus consider the 
Gaussian kernel for the FW-Quad procedure as then 
we can compute the required quantities analytically. 
An important subtle point is which Hilbert space Tit 
to consider. In this paper, we focus on the marginal¬ 
ized filtering case, i.e. we are interested in p(x t \yi-.t) 
only. Thus we are only interested in functions of Xt, 
which is why we define our kernel at time t to only 
depend on Xt and not the past histories. For simplic¬ 
ity, we also assume that Tit = Tl for all t (we use the 
same kernel for each time step). Even though the algo¬ 
rithm can maintain the distribution on the whole his¬ 
tory pt(xi;t), the past histories are marginal¬ 

ized out when computing the mean map, for example 
p(p t ) = Ej 5 t(a;i;t )[$(x t )]. During the SKH algorithm, 
we can still track the particle histories by keeping track 
from which mixture component in (4) x t was coming 
from, but the past history is not used in the compu¬ 
tation of the kernel and thus does not appear as a 
repulsion term in step 3 of Algorithm 1. We leave it as 
future work to analyze what kind of high-dimensional 
kernel on past histories would make sense in this con¬ 
text, and to analyze its convergence properties. The 
particle histories are useful in the Rao-Blackwellized 
extension that we present in Appendix A and use in 
the robot localization experiment of Section 4.3. 

3.3 Convergence theory 

In this section, we give sufficient conditions to guar¬ 
antee that SKH is consistent as N goes to infinity. 
Let pt here denote the marginalized predictive in¬ 
stead of the joint. Let F t be the forward transfor¬ 
mation operator on signed measures that takes the 
predictive distribution p t on Xt and yields the un¬ 
normalized marginalized predictive distribution F t pt 
on Xt+ 1 in the SSM. Thus for a measure v, we get 
(F t v)(■) ■— f Xt p(-\xt)p(yt\xt)dv(x t ). We also have 
that pt+i = w;FtPt- 

For the following theorem, Ft is a function space on 


Theorem 1 (Bounded growth of the mean map). 
Suppose that the function ft : {xt+ i,Xt) 
p{yt\xt)p{xt+i\xt) is in the tensor product function 
space F t (S)Tlt with the following defined nuclear norm: 
\\ft\\r t ®'H t ■= inf Yji IMbJIAIIwt, where the infimum 
is taken over all the possible expansions such that 
ft(x t+ i,x t ) = J2i a i( x t+i)/3i{xt) for allx t ,x t +i. Then 
for any finite signed Borel measure v on X t , we have: 

\\v(Ftv)\\-H t+1 < IIm(t)I \u t - 

Theorem 2 (Consistency of SKH). Suppose that for 
all 1 < t < T, f t is in F t <8>Tlt as defined in Theorem 1 
and Ot is in Tit ■ Then we have: 2 


IIMpt) - t(pt)\\h t < 


£T + R 


IgT-llmr-i 

Ht-i 


T-l /T—2 \ 

+ PT- 1 ) II P k 

t—1 \ k—t / 


where p t := Xt ■- J]^ 1 a nd i t is the 

FW error reported at time t by the algorithm: it := 
II P(Pt) - P{Pt)\\Uf 


We note that \t « 1 as we expect the errors on Wk 
to go in either direction, and thus to cancel each other 
over time (though in the worst case it could grow expo¬ 
nentially in t). If it < e and pt < p , we basically have 
II p(pt) - p(pt) || = 0(p T e) if p > 1; 0(Te) if p = 1; 
and O(e) if p < 1 (a contraction). The exponential 
dependence in T is similar as for a standard particle 
filter for general distributions; see Douc et al. (2014) 
though for conditions to get a contraction for the PF. 

Importantly, for a fixed T it follows that the rates of 
convergence for Frank-Wolfe in N translates to rates 
of errors for integrals of functions in Tl with respect 
to the predictive distribution pt■ Thus if we suppose 
that Tl is finite dimensional, that pt has full support 
on X for all t and that the kernel k is continuous, then 
by Proposition 1 in Bach et al. (2012), we have that 
the faster rates for Frank-Wolfe hold and in particu¬ 
lar we could obtain an error bound of 0(1/Af) with N 
particles. As far as we know, this is the first explicit 
faster rates of convergence as a function of the number 

Hn general, the integral on Xt+i should be with re¬ 
spect to the base measure for which the conditional density 
p(xt+i\xt) is defined. All proofs are in the supplementary 
material. 

2 We use the convention that the empty sum is 0 and 
the empty product is 1. 
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d = 2, K = 100, o 2 = 1 




Figure 1: Top: MMD error for different sampling 
schemes where p is a mixture of 2d Gaussians with 
K = 100 components. Bottom: error on the mean 
estimate for the same mixture. The dashed lines are 
linear fits with slopes reported next to the axes. 

of particles than the standard 0(-T=) for Monte Carlo 
particle filters. In contrast, Gerber and Chopin (2014, 
Theorem 7) showed a o{-^=) rate for the randomized 
version of their SQMC algorithm (note the little-o). 3 
Note that the theorem does not depend on how the 
error of e is obtained on the mean maps of the distri¬ 
bution; and so if one could show that a QMC point 
set could also achieve a faster rate for the error on the 
mean maps (rather than on the distributions itself as 
is usually given), then their rates would translate also 
to the global rate by Theorem 2. 4 

4 Experiments 

4.1 Sampling from a mixture of Gaussians 

We start by investigating the merits of different sam¬ 
pling schemes for approximating mixtures of Gaus¬ 
sians, since this is an intrinsic step to the SKH al¬ 

3 The rate holds on the approximation of integrals of 
continuous bounded functions. 

4 We also note that a simple computation shows that 
for a Monte Carlo sample of size N, E||/r(p) — p(p )||^ < 

(f ? 2 —||m(p)|| 2 ) 

N 


gorithm. In Figure 1, we give the MMD error as 
well as the error on the mean function in term of 
the number of particles N for the different sampling 
schemes on a randomly chosen mixture of Gaussians 
with I\ = 100 components in d = 2 dimensions. Addi¬ 
tional results as well as the details of the model are 
given in Appendix C.l of the supplementary mate¬ 
rial. In our experiments, the number of FW search 
points is M = 50,000. We note that even though in 
theory all methods should have the same rate of con¬ 
vergence 0(1/\fN) for the MMD ( as T~L is infinite di¬ 
mensional), FCFW empirically improves significantly 
over the other methods. As d increases, the differ¬ 
ence between the methods tapers off for a fixed kernel 
bandwidth a 2 , but increasing a 2 gives better results 
for FW and FCFW than the other schemes. 

In the remaining sections, we evaluate empirically the 
application of kernel herding in a filtering context us¬ 
ing the proposed SKH algorithm. 

4.2 Particle filtering using SKH on synthetic 
examples 

We consider first several synthetic data sets in order 
to assess the improvements offered by Frank-Wolfe 
quadrature over standard Monte Carlo and quasi- 
Monte-Carlo techniques. We generate data from four 
different systems (further details on the experimental 
setup can be found in Appendix C.2): 

Two linear Gaussian state-space (LGSS) mod¬ 
els of dimensions d = 3 and d = 15, respectively. 

A jump Markov linear system (JMLS), consist¬ 
ing of 2 interacting LGSS models of dimension 
d = 2. The switching between the models is gov¬ 
erned by a hidden 2-state Markov chain. 

A nonlinear benchmark time-series model used by, 
among others, Doucet et al. (2000); Gordon et al. 
(1993). The model is of dimension d = 1 and is 
given by: 

x t +i = 0.5a; t + 25 Xt 2 + 8cos(1.2f) + v t , 

1 + x t 

yt = 0.05x 2 + e t , 

with vt and et mutually independent standard 
Gaussian. 

These models are ordered in increasing levels of diffi¬ 
culty for inference. For the LGSS models, the exact 
filtering distributions can be computed by a Kalman 
filter. For the JMLS, this is also possible by running 
a mixture of Kalman filters, albeit at a computational 
cost of 2 t (where T is the total number of time steps). 
For the nonlinear system, no closed form expressions 
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are available for the filtering densities; instead we run 
a PF with N = 100,000 particles as a reference. 

We generate 30 batches of observations for T = 100 
time steps from all systems, except for the JMLS where 
we use T = 10 (to allow exact filtering). We run the 
proposed SKH filter, using both FW and FCFW op¬ 
timization and compare against a bootstrap PF (us¬ 
ing stratified resampling (Carpenter et ah, 1999)) and 
a quasi-Monte-Carlo PF based on a Sobol-sequence 
point-set. All methods are run with N varying from 
20 to 200 particles. We deliberately use rather few par¬ 
ticles since, as discussed above, we believe that this is 
the setting when the proposed method can be partic¬ 
ularly useful. 

To assess the performances of the different meth¬ 
ods, we first compute the root-mean-squared errors 
(RMSE) for the filtered mean-state-estimates over 
the T time steps, w.r.t. the reference filters. We re¬ 
port the median RMSEs over the 30 different data 
batches, along with the 25% and 75% quantiles, and 
the minimum and maximum values in Figure 2. The 
SKF1 algorithms were run for three different values of 
a 2 £ {0.01,0.1,1}. Here, we report the results for 
a 2 = 1 for the LGSS models and the JMLS, and for 
a 2 = 0.1 for the nonlinear benchmark model. The re¬ 
sults for the other values are given in Appendix C.2. 
The improvements are somewhat robust to the value 
of o' 2 , but in some cases significant differences were ob¬ 
served. As can be seen, both SKH methods improve 
significantly upon both QMC and the bootstrap PF. 
For the two LGSS models, we also compute the MMD 
(reported in the rightmost column in Figure 2). 

4.3 Vision-based UAV Localization 

In this section, we apply the proposed SKH algo¬ 
rithm to solve a filtering problem in field robotics. 
We use the data and the experimental setup described 
by Tornqvist et al. (2009). The problem consists of es¬ 
timating the full six-dimensional pose of an unmanned 
aerial vehicle (UAV). 

Tornqvist et al. (2009) proposed a vision-based solu¬ 
tion, essentially tracking interest points in the camera 
images over consecutive frames to estimate the ego- 
motion. This information is then fused with the in¬ 
ertial and barometer sensors to estimate the pose of 
the UAV. The system is modelled on state-space form, 
with a state vector comprising the position, velocity, 
acceleration, as well as the orientation and the angu¬ 
lar velocity of the UAV. The state is also augmented 
with sensor biases, resulting in a state dimension of 22. 
Furthermore, the state is augmented with the three- 
dimensional positions of the interest points that are 
currently tracked by the vision system; this is a vary¬ 


ing number but typically around ten. 

To deal with the high-dimensional state-vector, 
Tornqvist et al. (2009) used a Rao-Blackwellized PF 
(see Appendix A) to solve the filtering problem, 
marginalizing all but 6 state components (being the 
pose, i.e., the position and orientation) using a combi¬ 
nation of Kalman filters and extended Kalman filters. 
The remaining 6 state-variables were tracked using a 
bootstrap particle filter with N = 200 particles; the 
strikingly small number of particles owing to the com¬ 
putational complexity of the likelihood evaluation. 

For the current experiment, we obtained the code and 
the flight-test data from Tornqvist et al. (2009). The 
modularity of our approach allowed us to simply re¬ 
place the Monte Carlo simulation step within their 
setup with FW-Quad. We ran SKH-FW with er 2 = 10 
and SKH-FCFW with <r 2 = 0.1, as well as the boot¬ 
strap PF used in Tornqvist et al. (2009), and a QMC- 
PF; all methods using N = 50, 100, and 200 parti¬ 
cles. We ran all methods 10 times on the same data; 
the variation in SKH coming from the random search 
points for the FW procedure, and in QMC for starting 
the Sobol sequence at different points. For compari¬ 
son, we ran 10 times a reference PF with N = 100,000 
particles and averaged the results. The median posi¬ 
tion errors for 100 seconds of robot time (there are 20 
SSM time steps per second of robot time) are given in 
Figure 3. The UAV is assumed to start at a known 
location at time zero, hence, all the errors are zero ini¬ 
tially. Note that all methods accumulate errors over 
time. This is natural, since there is no absolute po¬ 
sition reference available (i.e., the filter is unstable) 
and the objective is basically to keep the error as 
small as possible for as long time as possible. SKH- 
FW here gives the overall best results, with significant 
improvements over the bootstrap PF and the QMC 
methods for small number of particles. SKH-FW even 
gives similar errors for the last time step with only 
N = 200 particles as one of the reference PFs (us¬ 
ing N = 100,000 particles). See Appendix C.2.1 for a 
discussion of the role of a 2 for FCFW. 

Runtimes. In these experiments, we focused on in¬ 
vestigating how optimization could improve the error 
per particle, as the gain in runtime depends on the 
exact implementation as well as the likelihood eval¬ 
uation cost. We note that the FW-Quad algorithm 
scales as 0{NM) for N samples and M search points 
when using FW, by updating the objective on the M 
search points in an online fashion (we also empirically 
observed this linear scaling in N). On the other hand, 
FCFW scales as 0(N 2 M) as the weights on the parti¬ 
cles possibly change at each iteration, preventing the 
same online trick. SKH scales linearly with the num¬ 
ber of time steps T (as a standard PF). For the UAV 
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Figure 2: RMSEs (left and middle columns) for the four considered models and MMDs (right column) for the 
two LGSS models. 


JV = 50 JV = 100 




UAV - last time step error 



Figure 3: Median of position errors over 10 runs for each method. The errors are computed relative to the mean 
prediction over 10 runs of a PF with 100k particles (the variation of the reference PF is also shown for PF 100k). 
The error bars represent the [25%, 759b] quantile. The rightmost plot shows the error at the last time step as a 
function of N. 100 s of robot time represents 2,000 SSM time steps, it does not correspond to computation time. 


application, the original Matlab code from Tornqvist 
et al. (2009) spent an average of 0.2 s per time step for 
N = 50 particles (linear in the number of particles as 
the likelihood evaluation is the bottleneck) on a XEON 
E5-2620 2.10 GHz PC. The overhead of using our Mat- 
lab implementation of FW-Quad with N = 50 is about 
0.15 s per time step for FW and 0.3 s for FCFW; and 
0.3 s for FW and 1.0 s for FCFW for N = 100 (we 
used M = 10,000 search points in this experiment). 
In practice, this means that SKH-FW can be run here 
with 50 particles in the same time as the standard PF 
is run with about 90 particles. But as Figure 3 shows, 
the error for SKH-FW with 50 particles is still much 
lower than the PF with 200 particles. 


5 Conclusion 

We have developed a method for Bayesian filtering 
problems using a combination of optimization and par¬ 
ticle filtering. The method has been demonstrated to 
provide improved performance over both random sam¬ 
pling and quasi-Monte Carlo methods. The proposed 
method is modular and it can be used with different 
types of particle filtering techniques, such as the Rao- 
Blackwellized particle filter. Further investigating this 
possibility for other classes of particle filters is a topic 
for future work. Future work also includes a deeper 
analysis of the convergence theory for the method in 
order to develop practical guidelines for the choice of 
the kernel bandwidth. 

























































































Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach 


Acknowledgements 

We thank Eric Moulines for useful discussions. This 
work was partially supported by the MSR-Inria Joint 
Centre, a grant by the European Research Council 
(SIERRA project 239993) and by the Swedish Re¬ 
search Council (project Learning of complex dynamical 
systems number 637-2014-466). 

References 

F. Bach. Learning with submodular functions: A 
convex optimization perspective. Foundations and 
Trends in Machine Learning, 6(2-3):145-373, 2013. 

F. Bach, S. Lacoste-Julien, and G. Obozinski. On the 
equivalence between herding and conditional gradi¬ 
ent algorithms. In Proceedings of the 29th Inter¬ 
national Conference on Machine Learning (ICML), 
pages 1359-1366, 2012. 

O. Cappe, E. Moulines, and T. Ryden. Inference in 
Hidden Markov Models. Springer, 2005. 

J. Carpenter, P. Clifford, and P. Fearnhead. Improved 
particle filter for nonlinear problems. IEE Proceed¬ 
ings Radar, Sonar and Navigation, 146(1):2- 7, 1999. 

Y. Chen, M. Welling, and A. Smola. Super¬ 
samples from kernel herding. In Proceedings of the 
26th International Conference on Machine Learning 
(ICML), 2010. 

R. Douc, E. Moulines, and J. Olsson. Long-term sta¬ 
bility of sequential Monte Carlo methods under ver¬ 
ifiable conditions. Annals of Applied Probability , 24 
(5):1767-1802, 2014. 

A. Doucet and A. Johansen. A tutorial on particle 
filtering and smoothing: Fifteen years later. In 
D. Crisan and B. Rozovsky, editors, The Oxford 
Handbook of Nonlinear Filtering. Oxford University 
Press, 2011. 

A. Doucet, S. J. Godsill, and C. Andrieu. On sequen¬ 
tial Monte Carlo sampling methods for Bayesian 
filtering. Statistics and Computing, 10(3):197-208, 
2000 . 

J. C. Dunn. Convergence rates for conditional gra¬ 
dient sequences generated by implicit step length 
rules. SIAM Journal on Control and Optimization, 
18:473-487, 1980. 

P. Fearnhead. Using random quasi-Monte-Carlo 
within particle filters, with application to financial 
time series. Journal of Computational and Graphical 
Statistics, 14(4):751-769, 2005. 

M. Frank and P. Wolfe. An algorithm for quadratic 
programming. Naval Research Logistics Quarterly, 
3:95-110, 1956. 


M. Gerber and N. Chopin. Sequential quasi-Monte 
Carlo. arXiv preprint arXiv:l)02.)039v5, 2014. 

N. J. Gordon, D. J. Salmond, and A. F. M. 
Smith. Novel approach to nonlinear/non-Gaussian 
Bayesian state estimation. Radar and Signal Pro¬ 
cessing, IEE Proceedings F, 140(2): 107-113, Apr. 
1993. 

A. Gretton, K. M. Borgwardt, M. J. Rasch, 
B. Scholkopf, and A. Smola. A kernel two-sample 
test. The Journal of Machine Learning Research, 
13:723-773, 2012. 

F. Huszar and D. Duvenaud. Optimally-weighted 
herding is Bayesian quadrature. In Proceedings of 
the 28tli Conference on Uncertainty in Artificial In¬ 
telligence (UAI), pages 377-385, 2012. 

M. Jaggi. Revisiting Frank-Wolfe: Projection-free 
sparse convex optimization. In Proceedings of the 
30th International Conference on Machine Learning 
(ICML), 2013. 

D. Ormoneit, C. Lemieux, and D. J. Fleet. Lattice 
particle filters. In Proceedings of the 17th Confer¬ 
ence on Uncertainty in Artificial Intelligence (UAI), 
pages 395-402, 2001. 

V. Philomin, R. Duraiswami, and L. Davis. Quasi¬ 
random sampling for condensation. In Proceedings 
of the 6th European Conference on Computer Vision 
(ECCV), 2000. 

M. K. Pitt and N. Shephard. Filtering via simulation: 
Auxiliary particle filters. Journal of the American 
Statistical Association, 94(446):590-599, 1999. 

B. Ristic, S. Arulampalam, and N. Gordon. Beyond 
the Kalman filter: particle filters for tracking appli¬ 
cations. Arteclr House, London, UK, 2004. 

A. Smola, A. Gretton, L. Song, and B. Scholkopf. A 
Hilbert space embedding for distributions. In Al¬ 
gorithmic Learning Theory, pages 13-31. Springer, 
2007. 

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, 
B. Scholkopf, and G. R. Lanckriet. Hilbert space em¬ 
beddings and metrics on probability measures. The 
Journal of Machine Learning Research, 99:1517- 
1561, 2010. 

D. Tornqvist, T. B. Schon, R. Karlsson, and 
F. Gustafsson. Particle filter SLAM with high di¬ 
mensional vehicle model. Journal of Intelligent and 
Robotic Systems, 55(4):249-266, 2009. 



Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering 


Supplementary material 


A Extension for Rao-Blackwellization 

A common strategy for improving the efficiency of the PF is to make use of Rao-Blackwellization -this idea 
can be used also with SKH. Rao-Blackwellization, here, refers to analytically marginalizing some conditionally 
tractable component of the state vector and thereby reducing the dimensionality of the space on which the PF 
operates. Assume that the state of the system is comprised of two components Xt and Zt, where the filtering 
density for Zt is tractable conditionally on the history of x\-t- The typical case is that of a conditionally linear 
Gaussian system, in which case the aforementioned conditional filtering density p(zt\x\-t 1 yi-.t) is Gaussian and 
computable using a Kalman filter (conditionally on x \ ;t ). The Rao-Blackwellized PF (RBPF) exploits this 
property by factorizing: 

N 

p(z t ,xi-.t\yi:t) = p(z t \x 1 ,t,yi:t)p(x 1:t \y 1 :t) « 'Vw t ( 0 A/'(z t |S t (x^. ) t ),E t (a;^ ) t ))(5 (i)(a:i :t ), (5) 

ZJ X l:t 

i— 1 

where the conditional mean Zt{x\-t) := yi-t] and covariance matrix Et(xi,*) := ’V(zt\xi : t,yi-.t) can be 

computed (for a fixed trajectory oq.f) using a Kalman filter. The mixture approximation follows by plugging in a 
particle approximation of p(x\ : t\y\-.t) computed using a standard PF. Hence, for a conditionally linear Gaussian 
model, the RBPF takes the form of a Mixture Kalman filter; see Chen and Liu (2000). Analogously to a 
standard PF, the SKH procedure allows us to to compute an empirical point-mass approximation of p(xi : t\yv.t) 
by keeping track of the complete history of the state X\-t- Consequently, by (5) it is straightforward to employ 
Rao-Blackwellization also for SKH; we use this approach in the numerical example in Section 4.3. 

B Rates for SKH when using random search points 

In this section, we show that we can get guarantees on the MMD error of the FW-Quad procedure when 
approximately finding the FW vertex in step 3 of Algorithm 1 using exhaustive search through M random 
samples from p. This means that despite not solving step 3 exactly, the SKH procedure with M random search 
points (under assumptions of Theorem 2) is still consistent as long as M grows to infinity. 

The main idea is that the rates of convergence for the Frank-Wolfe optimization procedure still holds when the 
linear subproblem (step 3) is solved within accuracy of <5. More specifically, if we guarantee that the FW vertex 
g k . i-i that we use satisfy ( J'(gk), fffc+i) < mi rt g £Mi(J'(g k ),g) + <5 during the algorithm, then the standard 0(l/k) 
rate of convergence for FW carries through but within S of the optimal objective (i.e. up to J(g*) +5). A simple 
modification of the argument by Jaggi (2013) (who used a shrinking 6 during the FW algorithm) can show this 
for the step-size of 7 k = i we give the proofs for the step-size of 7 *, = as well as the potential faster rate 
0(l/fc 2 ) for the MMD objective in Appendix G. 

Let Xm C X be the set of M search points, and pm be the empirical distribution for the M samples from p. 
Let 5m ■= || p(pm) — p(p)\\n which can be made small by increasing M. Consider the iteration k in FW-Quad 
where we do exhaustive search on Xm in step 3. We thus have: 

(9k ~ lip, 4>(x (fc+1) )) = min (g k - p pi $(x)) = min (g k - p(p M ) + p(p M ) ~ p(p),$(x)) 

x£Xm 

< min (g k - p(p M ), ®(x)) + 5 M R M , 

where Rm := max Ig ^ M ||$(x)|| (Rm < R)- We can thus interpret step 3 as approximately solving (within 
5mRm) the linear subproblem for the Frank-Wolfe optimization of Jm(9) ■= \ \\g — p(pm)\\u over the marginal 
polytope of Xm- We thus get a rate of convergence to within 5mRm of min fl Jm(9) = 0. Finally, we have 


llffjv — p(p)\\u < || 9n — p(pm)\\h + 5m = \/2Jm(^v) + 5m < V^( e N + Rm5m) + 5m 







Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach 






Figure 4: Error on the mean function (top row) and MMD error (bottom row) for the mixture of Gaussians 
experiment. The first column is for I\ = 20 and d = 2. The next two columns are for the same mixture of 
Gaussians in higher dimension d = 5 with K = 100 components, but running FW-Quad with a 2 = 1 (middle 
column) or a 2 = 100 (last column). We see that using a higher a 2 helps significantly in higher dimension. The 
dashed lines are linear fits with slopes reported next to the axes. 


where cn would be the error after N steps of a standard (non-approximate) Frank-Wolfe procedure (e.g. 0(1/N), 
though it could be 0(1/N 2 ) if p(pm) is in the strict interior of the marginal polytope of X M as we show in 
Appendix G). Finally, we know that E[<$m] < R/y/M, and we could also obtain a high probability bound for it 
as well using a concentration inequality with triangular arrays. This gives the guarantee for the MMD error of 
the SKH procedure with M random search points (with a term of 0(1/M 1 / 4 )). Even though the rate is slow in 
M. the approach is motivated for problems where the bottleneck is the evaluation of the observation probability 
(which is only evaluated N times per time step) whereas M can be taken to be very large. We also note that if 
T~L is finite dimensional and the kernel k is continuous, then an asymptotically faster rate of 0(1/y/M) can be 
shown (see Appendix G), though with a worse constant that makes the comparison for smaller M less clear. 

C Additional details on experiments 

C.l Mixture of Gaussians experiment 

The parameters for the mixture of Gaussians p(x) = 7r»A/"(a;|, £,) were randomly sampled as follows: 

• The means /V s are uniformly sampled on [—5, 5] d . 

• £, = a 2 1 where of is uniformly sampled on [0.1,4.1], 

• 7 Ti are obtained by normalizing independent uniform random variables. 

Figure 4 present additional results for the mixture of Gaussians experiments. From our experiments, we make 
the following observations: 

- FCFW always performs best (this was observed similarly in Bach et al. (2012) but for other pairs of distribution 
/ kernel). 
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As d increases, the difference between the methods tapers off for a fixed a 2 , but increasing a 2 gives better 
results for FW and FCFW than the others (see for example the last column of Figure 4). 

- The FW-LS results are identical to FW, and so we have excluded them from the plots for clarity. 

- The improvement of QMC over MC decreases as the number of mixture components I\ increase. FW and 
FCFW are not affected by K as much. 

QMC implementation. To generate quasi-random samples from the mixture of Gaussians, we generate a 
(d+l)-dimensional Sobol sequence using the Matlab qrandstream function. The last dimension is (naively) used 
to sample the mixture component by using the inverse transformation method for a discrete random variable. 
The first d components are then used to sample from the corresponding multivariate Gaussian by transforming 
d independent standard normals. We note that Gerber and Chopin (2014) argued on the importance of sorting 
the discrete mixture components according to their location before choosing them with the standard inverse 
transformation method (in our naive implementation, the order is arbitrary and arising from how the mixture of 
Gaussians was stored). They propose a method for this that they called Hilbert sort , for which they could prove 
nice low-discrepancy properties. This approach might reduce the sensitivity to K of QMC. The worse results of 
QMC for the UAV experiment in Figure 3 might be explained by our naive implementation. 

C.2 Synthetic data examples and additional results 

In this section we provide additional details and results for the synthetic data examples. The LGSS models 
and the modes of the JMLS are generated randomly using the function drss from the Matlab Control Systems 
Toolbox. The four models that were considered are given by: 


LGSS, d = 3 on the form 

x t +i =Ax t +v t , v t ~Af(0,I), 

yt + i = Cx t + e t , e t ~ W(0,0.1) 

with (A, C) being an observable pair. The system has poles in —0.2825 and —0.3669 ± 0.0379b 

LGSS, d = 15 on the form 


x t+1 =Ax t +v t , v t ~Af(0,I), 

y t+ i=Cx t + e t , e t ~ A/”(0, 0.1) 

with (A, C) being an observable pair. The system has poles in 0.2456 ± 0.6594*, 0.4833, 0.3329, 0.0882 ± 
0.2512*, -0.1485, -0.8045, -0.4848, -0.5252 ± 0.0368*, -0.6692 ± 0.0612i, -0.6604, and -0.6680. 

JMLS on the form 


P(r t+ i = l\r t = k) = n ki , 

x t+ i = A rt x t + F rt v t , v t ~ W( 0 , /), 

yt C rt Xt T G rt e t , et ~ A/"( 0 , 1 ), 

/07 o 3\ 

with II = ( ' ' 1, and the two system modes corresponding to observable systems with poles in —0.4429, 

0.0937, and —0.6576, 0.3109, respectively. 

Nonlinear benchmark model as described in the main text. 


Additional results, for the different values of a 2 € {0.01, 0.1,1} are reported in Figures 5-8. 
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Figure 5: RMSE for JMLS, using cr 2 = 0.01, cr 2 = 0.1, and a 2 = 1 (left to right). 





Figure 6: RMSE for nonlinear benchmark model, using a 2 = 0.01, a 2 = 0.1, and cr 2 = 1 (left to right). 


C.2.1 Discussion of role of cr 2 for FCFW 

The results in Figure 6 for the nonlinear benchmark show an interesting behavior for FCFW when the kernel 
bandwidth a 2 is increasing. In particular, for a 2 = 1 (rightmost plot), SKH-FCFW obtains the lowest error for 
all methods (and other a’s) at N = 20, but its error stays constant when increasing the number of particles 
while the other methods see their error decreasing. This phenomenon needs to be carefully studied further. Our 
current hypothesis is that when cr 2 is large, FCFW is too effective at myopically optimizing the MMD error for 
the mixture of Gaussians p t and yields a too small effective sample size (it sets many weights of particles to 
zero), thus hurting the particle filtering error. When d, is small or when cr 2 is large, the Gaussian kernel matrix 
becomes rank-deficient due to numerical precision; we thus have numerically a finite dimensional T~L. In the case 
of the Id nonlinear model, FCFW sometimes could optimize the MMD error within numerical precision (its 
square of the order of le-16) within 30 particles. FW-Quad would thus output only 30 particles even though we 
asked it to produce N > 30. This explains why increasing N did not translate in a reduction of filtering errors 
for SKH-FCFW with cr 2 = 1: the effective number of particles stayed much less than N. 

SKH-FW did not seem to suffer from this problem. This might partly explain why we were able to use the much 
bigger a 2 = 10 for the UAV experiment with good results (Figure 3) whereas we used a 2 = 0.1 for SKH-FCFW. 
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Figure 7: RMSE (top row) and MMD (bottom row) for LGSS (d = 3), using a 2 = 0.01, a 2 = 0.1, and a 2 = 1 
(left to right). Note that the MMD definition depends on a 2 . This is why the MMD curves for PF and QMC 
are also changing with a 2 (but their RMSE ones are not). 
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Figure 8: RMSE (top row) and MMD (bottom row) for LGSS (d = 15), using a 2 = 0.01, a 2 = 0.1, and a 2 = 1 
(left to right). 
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D Proof sketch for Theorem 1 


Proof sketch. We assume that the function f t : (x t +i,x t ) p(y t \x t )p(x t +i\xt) is in the tensor product 
T t ® Ht, with T t defined as in the statement of the theorem. We consider the nuclear norm (Jameson, 1987): 

= inf E |[q^» liftll~Ht 

over all possible decompositions {oi, AliSi of ft such that, for all Xt,Xt+i 

ft(x t +i,x t ) = ^2ai(x t+1 )/3i(x t ). 

i 

In the following, let be such a decomposition for f t . We have 

(F t u)(x t+ i) = J p(x t+1 \x t )p(yt\x t ) dv(xt) £ R 

p(F t v) = j'(F t v)(xt+i)<b{xt + i)dx t+ i e T-Lt+i- 
Now we have that ||/i(i 7 i^)||^ t+1 = sup^y^ ^ =1 \ (h, p,{Ftv))\, so we consider for some h € T-Lt+i- 

{h,n{F t u)) = J(F t i/)(xt+i)h(xt+i)dx t+ i (by linearity and reproducing property) 

= J (y ft(x t +i,x t )du(x t )^h(xt +1 )dx t+1 
= J J ft(xt+i,xt)h(x t +i)dx t+ idi/(x t ) (Fubini’s theorem) 

= J J ^ E ajjxt+i )/3j{x t )] h(x t+1 )dx t+1 du(x t ) 

= E yy Pi{ x t)dv(x t ) ^ (y oti{x t+ i)h{x t+ {)dxt + \^J (Fubini’s theorem). 

By (3), we have that |E^[ySi]| < ||/3 i ||^J / u(i/)|| Wt . Thus we have: 


\\y(F t v)\\ nt+1 = sup \(h,p(F t v))\ = sup 


E® 1 ^ (Ja i{ x t+1 Mxt +1 )dx t+1 


< V \ E APi\\ ( Sup [ 

4 " ---' V INIw, i , =1 J 


ai(x t +i)h(x t+1 )dx t+ i 


<m\\u t \Mv)\\n t 




< 


EiNiEiftik ikkk. 


This inequality was valid for any expansion for f tl and thus we can take the infimum of the upper 

bound over all possible expansions to get: 


\\p{F t v)\\H t+1 < WftWrt&HtWvWWut 


as we wanted to prove. 
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E Special case for the Gaussian kernel 


In this section, we explore what form || • ||jf takes for the Gaussian kernel. We then show that ||/t ||.F<g>'H is finite 
for a simple one-dimensional linear Gaussian SSM as long as a is small enough (and thus TL is big enough, as 
the size of 7L increases when er decreases for the Gaussian kernel). 

For the Gaussian kernel n(x,y) = exp(—||x — y||§/2cr 2 ) =: q(x — y), its Fourier transform is 


and 


Moreover, 


q(u) = J e"“ T “g(x) = {2Tr) d t 2 a d e-^ u ^> 

II, 1,2 1 /'IM w )l 2 , 


(a,h) L 2 = —(a,h) L 2 = 


(2n) a 


(27r) d J g(w) 1 / 


^2$(w) 1/2 a(w)dw. 


By applying Cauchy-Schwartz on L 2 on the RHS, this leads to 


|(a, h) L 2 | < \\h\\n 


1 


(27 


\a(oj)\ 2 q(uj)du] 


1/2 


Thus, we can take the function space: 

IH& = j^yij |4( U )I 2 4(^)^ = ^ 

This allows for quite peaky distributions for the dynamics, as Diracs are authorized (with constant Fourier 
transform). 

To compute an upper bound on we simply need to find a decomposition of p(yt\xt)p(xt+i\xt) as a sum 

of terms ai(xt+i)(3i(xt) and bound the appropriate norms of at and /3,;. 

We do this for a special case in the following section. 


E.l Bound for one-dimensional Gaussian distribution 

We assume that x t +i € and y t € R m , and that 

p{x t+ i\x t ) = 


1 


p(yt\x t ) = 


(x/ 27rr) d 
1 

( \[2/KV) r1 


e ~ ^rz\\ x t+i-Ax t \\ 


e -^l\\vt-Bx t \\l 


We only do the proof for d = m = 1 and yt = 0 (constant observations) to make the proof simpler. We conjecture 
that similar results hold more generally. We use the Mchler formula for w such that = y? (Abramowitz 

and Stegun, 2012), where H n is the n th Hermite polynomial ( H n {x ) := (— l) n e~ x ^yre _:r ): 

oo 

^2xyw / (1-w 2 ) = +y 2 )w 2 /(1—w 2 ) ( w /2) n H n {x)HJy). 

^ n\ 

n—0 


Thus 


p(x t+ i\x t )p{yt\xt) 


1 c -^2Xt + ,-^r A2 Xt e (x? + l+A 2 X?)w 2 /(l-W 2 ) 1 C~^ B2x2 


(- \[2kt ) 


{\[2j:v) 


_ 1 

\/l - w 2 —(w/2) n H n (x t +i)H n (Ax t ) 

Z - J 71.\ 


n—0 


•s/i — 


(x/27tt) (V2ttv) 


oo - 

V —Aw/2) n a n {xt+i)t3 n {x t ) 

z ' n! 

n—0 
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with, using - 5^2 + = yb# = 

®-n (^t+ 1 ) 

/3n{Xt) 


e -x* t+1 w/(l+w) Hn{xt+i) 
e -AV t w/(l+w) e -^B^ tHn ( Axt y 


We thus now need to compute the norms of a n and /3 n , by first computing the Fourier transform. We use the 
representation: 

Hn( - X) = 2i7r/ e ‘ ^ tt+i’ 

integrating over a contour around the origin, which leads to: 


x (w) = f e- iwx H n (x)e~ x2w /^ +w Ux 

J R 


2m 


„—iujx n ' i c -t 2 +2xt c ~x 2 w/(l+w) ai 


t n+1 


— ie~ t2 ( [ e-^e+^e-^/^+^dx)^. 
2 in J \ .L J *" +1 


We may now use 


e~ ax +bx dx=J-e b /4a 
a 


to get, using —1 + 4(1 +w)/4w = 1 /w, 


»(<*>) = 


2z7T 




n\ 


_ exp , _Ml±JdV/dl±z) I^/~ exp ( z^H±A )A L 

2 z7r \ 4u> / V w J V w J t n+1 


nl 

2 ^ eXP 


4u> 


u 2 (l +w)\ /tt ( 1 + w) w _ n/2 . n f e _ P ^ ( wi( 1 + « 0 \ dt 


J t n+1 


using the change of variable t = iy/wt. This leads to 


+ /»(l + »)..._„ /2Cexp 

w _ " / 2 C( 2 "n!v / 7 r ) 1/2 


|d n (w)| < exp ^ — 

< exp f — 


4u; 

w 2 (l — tu 2 ) \ / 7r(l + ru) _ r 


8 w 

using H n {x) < C exp(a; 2 /2)(2 rl n!v / 7r) 1 ^ 2 , with <7 = tt” 1 / 4 , and -±±f + ^( - 1 + = 

We thus have 


l-w z 

8w 


x n\\F 


V2tt 


J \a n (oj)\ 2 e U ’ 2a2/2 du} 


< „-» 2 »„!^C 2 di±^) f exp ( - “ a(1 ~ U,2) l 

w V27T J \ 4u> 

= «,-” 2 ’‘n !^ ’ r(1 + " , W -^- 


= (w n 2"»i!) x C(w,cr) 


y/Zncr 2 + (1 - w 2 )/2w 
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Moreover, 

AiM 


[ e- iux H n (Ax)e~ x ^ A2w/il+w)+B2/2v2) dj 

Jr 


AL-> 

2nr J 


c -iuix J, „-P„+2 Axt dt ^_ x 2(a 2 w/(1+w) + B 2 /2v 2 ) c i x 


t n+1 


e ~iuix e +2Axt £ -x 2 (A 2 w/(l+w) + B 2 /2v 2 ) ^ 


2 , \ dt 

X 


)! 


le- 1 


2i.Tt J 
n\ 

= ^T ex P 


A 2 w/{ 1 + w) + B 2 / 2v 
-J 1 


t n+l 

[2 At — icu] 2 


A{A 2 w/(l + w) + B 2 /2v 2 ) J J t n+1 


dt 


2m \4(A 2 w;/(l + w) + B 2 /2v 2 ) J \j A 2 w/(1 + w) + B 2 /2v 2 


7r 


x cp e 1 I exp 


■ exp 


4 A 2 t 2 — AAtiuj \ \ dt 

A(A 2 w/{\ + w) + B 2 /2v 2 ) J J ¥+ l 
, ,2 


2m \4(A 2 tu/(l + w) + B 2 /2v 2 ) J V A 2 w/( 1 + w) + B 2 /2v 


<j> exp (^t 


n\ 


■ exp 


A 2 — ( A 2 w /(1 + w) + B 2 / 2v 2 ) 
(. A 2 w/{\ + w) + B 2 /2v 2 ) 

,2 


exp 


—AAitu) \ \ dt 

4,(A 2 w/(\ + w) + B 2 /2v 2 ) J J t n+1 


—to 


7r 


2m " \4(A 2 w/(l + w) + B 2 /2v 2 ) J \ A 2 w/{\ + w) + B 2 /2v 2 

x l exp ft 2 A2 /( 1 + w )- B ' 2 / 2v2 ) ) ( exp ( - 4Ati “ 

x ® exp t . „ .. . „ . „ exp 


n! 

= 77— exp 


(A 2 w/(1 + w) + B 2 /2v 2 ) 

, ,2 


A(A 2 w/(l + w) + B 2 /2v 2 ) J J t n+1 


dt 


2m J \4(A 2 tu/(l + w) + B 2 /2v 2 ) J y A 2 w/(1 + w) + B 2 /2v 2 

f / 2 1 ^ B 2 {\ + w)/2v 2 A 2 ) \ / / —AAtiiv \\ dt 

J 6XP V w + B 2 {l+w)/2v 2 A 2 J { 6XP \4{A 2 w/(l+w) + B 2 /2v 2 ) J J ¥+^' 


We can now perform the change of variable t = ity = with w > w for B > 0. 

This leads to 


n\ 


Pn(u) = —exp 


—LU 


xw 


= exp 


2i7r \A(A 2 w/{1 + w) + B 2 /2v 2 ) J y A 2 w /(1 + w) + B 2 /2v 2 

AAty/dux \\ dt. 


n ! 2 ® exp(— t 2 ) ( exp 


A{A 2 w /(1 + w) + B 2 j 2v 2 ) J J t n+1 


xur n / 2 H 


A(A 2 w/(\ + w) + B 2 /2v 2 ) J y A 2 w/{ 1 + w) + B 2 /2v 2 
2 A\f&<jj 


\fi n (u)\ < exp 


4(^4 2 'u;/(l + w) + B 2 / 2v 2 ) 
2 


— u> 


^{A 2 w/{\ + w) + B 2 /2v 2 ) J Y A 2 w/( 1 + w) + B 2 /2v 2 


X^-n/2( 2 n n !v / 7r ) 1/ 2 C -exp ( - 


2 AtVtiuj 


_4(A 2 w/{l + w) + B 2 /2v 2 ) 
^ cst x ■u; _n ^ 2 (2"?r!v / 7r) 1 / 2 exp(—□(A, w, B)co 2 ) 


with 


□ (A, w, B) 


r 


4(t1 2 'u;/(1 + w) + B 2 /2v 2 ) 8 (A 2 w/( 1 + w ) + B 2 /2v 2 ) 2 
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We have D(A, w, 0) = ■ Thus, by continuity if B if small enough, D(A, w , B) > 0. Note that when we have 

B = 0 and A = 1, we recover previous results for a n . 

We thus have 

IM = 

< (2 n w~ n nl)C(A,B,w,a) 

as long as a 2 < 40(A, w , B). 

Thus, for a small enough, we have the norm less than a constant times 

OO 

w/w) n I 2 < 00 

n =0 

since w > w. This shows that ll^ll-HI/^nllw < oo and thus that C t is finite if the linear dependency 

parameter B and the kernel bandwidth a 2 are small enough. 

F Proof for Theorem 2 

Proof We recall here that ptixt) = p{xt\yi-.(t-i)) (we are in the marginalized setting). In the notation of the 
algorithm, we have pt+i = -r^rF t pt- Let qt = Pt%t -1 be the un-normalized marginalized predictive distribution 

(and similarly, q t = ptZ t - 1 ). We thus have pt+i = J -F t q t . We use the metric inequality (as well as the linearity 

At 

of the MMD in each of its argument as it is related to the RKHS norm; so a scalar multiplication of a distribution 
can be taken out of the MMD): 

MMD(p t+ i,ft + i) < MMD(pt+i, ^F t q t ) + MMD (^-F t q t: F t q t ) +^~ MMD (F t q t ,F t q t ) . 

z t z t z t z t v- v -/ 

s * -- y s * “v** ^ (HI) Initialization error 

(I) FW error := et+i (II) Normalization error 

The term (I) is the algorithmic Frank-Wolfe error it+i := MMD(pi + i,pj + i). The term (II) is the normalization 
error which can be bounded as follows: 

MMD (±F t q t , F t q t ) = \\-L»(F t q t )\\ n ± \Z t - Z t \ < ^MMD (q t ,q t ). 

Zt Zt A s -V- / Zv t 

s ---' (B) <||o t ||wMMD(gt,gt) 

(A) <R 

For inequality (A), we note that = pt+\ which is a normalized distribution on x t +i, this is why ||p,(p t+1 )||^ < 

At 

R as ||$(x)||-h < R Vx £ X by assumption. For inequality (B), we have that \Z t — Z t \ = |E gt [o t ] — E gt [o t ]| < 
||ot||-H MMD(gt,?t) by (3) under the assumption that o t £ R. 

Finally, the initialization error term (III) can be bounded by using Theorem 1 (with v = q t — q t ): 

MMD (F t q t , F t q t ) < C t MMD (q u q t ), 

where C t := ||/t||F®«- 

To control MMD(g t , q t ), we now only work on the un-normalized distributions: 

MMD(g t , q t ) < MMD(q t , Ft^qt-i) + MMD(F t -iq t -i, Ft-iqt-i 

" -V-' s -V- 

e t <Ct-i MMD( g t_i, g t_i) 

by repeating the arguments for smaller f’s and unrolling the recursion (and recall that Ol=t(‘) = 1 by convention). 
Combining the three terms, we thus get: 

MMD(p (+1 ,p f+1 ) < e t+ i + {R\\o t \\n + C t ) £ ^ ^ ) ■ 

u—1 \k—u / 


) - e “ ( tt Ck 


u—l 


V/c—li 


( 6 ) 
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Finally, we transform back the errors in the algorithmic quantities tt that the FW algorithm measures: 


e t+ i = MMD(p t+ i, 


W t 


MMD ( „ q t+u 

Zt 


-If.J-) 

W t Z t _! 


yMMD {q t+1 ,F t q t ) = 4-e t +i. 

Zt z f 


And so we can rewrite: 



We expect Xu = JIfc=i wf ~ 1 as the errors on the normalization constants could hopefully go in both direction 
and thus cancel each other, though in the worst case it could also grow with u. Substituting back in (6), we get 
what we wanted to prove: 


MMD(p t+ i,p t+ i) < e t+ i + 


(R\\ot\\u+C t ) 

W t 


t 

^2 Xu 

U—l 


n 

\k — u 


C4 

Wk 


(7) 


Remark 1 (Bound for Z t ). For parameter estimation in a HMM, one would also be interested in the quality of 
approximation for Z t . We note that inequality (B) also gives us a bound on the relative error of our estimate Z t 
for the normalization constant: 


Zt - 


\\°t\\n y' 

w t ^ x ' 




ft-1 

n 

\k—u 


c k 

w k 


Remark 2 (Bound for joint predictive distribution pf). To be more precise, we could have used the notation 
jit and MMD 4 to be explicit that the RKHS considered was for functions of aFor example Theorem 1 
really says that MMD t+ i {F t qt,F t q t ) < C t MMD t (g t , q t ). But since ji t = T~L (in the isomorphism sense) for 
all t , we did not have to worry about this. On the other hand, as jit contains functions of Xt only, we have 
that p(p t ) is the same whether p t is the marginalized or the joint predictive distributions p/ (as for the joint, 
the expectation in the mean map definition will marginalize out the variables Xi-^-i) as they do not appear 
in jit)- This means that if we consider the joint forward transformation F t J on a joint measures v J on aq :t : 
(F/v)(xt+i,Xi: t ) := p(x t+1 \x t )p{yt\x t )v J (xx-.t), i.e. pf +1 = WpJ p J ( n0 w i n the joint sense), then we have 
p,(F t p t ) = p{FtPt), and thus Theorem 1 also holds for the joint predictive distribution p^. 

Remark 3 (Bound without Xu)- The disadvantage of the bound (7) is the presence of the quantity Xu for which 
we did not provide an explicit upper bound (though we would expect it to be close to 1). To get an explicit 
upper bound for the error, we can repeat a similar argument but always working with the normalized quantities: 

MMD(p t+1 ,p t+1 ) < MMD(p t+1 , 2-F t pt) + MMD(-J-F t p t , ^—F t p t ) +-j— MMD( F t p t , F t p t ) . 

Wt Wt Wt Wt " -*-- 

v -v-- y s -v-- ' (HI) Initialization error 

(I) FW error := €t_|_i (II) Normalization error 

The term (II) is the normalization error which can bounded similarly as before as: 

MMD (^-F t p t , ±-F t p t ) = \\±rn(FtPt)\\n \W t - W t \ < ^MMD (p t ,p t ). 

Wt W t Wt W t "- v -' w t 

s --" (B) <||o t || H MMD(p t ,p t ) 

(A) <R 

Similarly as before, we also have for (III) that by Theorem 1 (with v = p t — p t ) that MMD (F t pt,F t p t ) < 
Ct MMD(p t ,p t ). Combining the three terms, we get: 

MMD (pt+i,Pt+i) < et+i + fl||Qtl ^ + Ct MMD(p t ,p t ) < £ ( IJ & 

u= 1 \k—u 


( 8 ) 















Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach 


where p t := R \\ 0t \\u+ C t ; by unrolling the recursion for smaller t’s. 

The problem with bound (8) is that p > 1 usually due to the extra term i?||ot|| in its definition, which is why we 
preferred the tighter form (7). 

Remark 4 (Removing the Ot £ R condition in Theorem 2). We note that the condition ot £ R is not really 
necessary in Theorem 2. If ot (/ R, we can instead re-derive a similar argument as above but using Z t = Ef t9l [1], 
where 1 is the constant unit function (here on xt+ 1 ). We then have \Z t — Z t \ < ||1||h' MMD 1 (F t qt, F t qt), where 
R' is an augmented RKHS to ensure that it contains the constant function 1. We define R' = R if 1 £ R already. 
If 1 R, we define R' to be the Hilbert sum of the RKHS R and the one generated by the constant kernel 1 (and 
thus R' is a RKHS with kernel k' = 1 + k where k is the original kernel for R ; see Berlinet and Thomas-Agnan 
(2004, Thm. 5)). We can show that running the Frank-Wolfe algorithm using the kernel k! yields exactly the 
same objective values and updates, and thus we can use the space R' to analyze its behavior: Theorem 1 and 
an analog of Theorem 2 then hold, but with all norms defined with respect to R' instead. 

G Faster rates for FW with approximate vertex search for the MMD objective 

In this section, we provide the proofs for the rate of convergence for FW on the MMD objective J(g) := 
||| g~ p{jp)\\\i when an approximate vertex search is used (as mentioned in Appendix B) by extending the proofs 
from Chen et al. (2010). We consider the step-size 7 *, = We note that the standard step-size for Frank- 

Wolfe optimization to get a 0(1/k) rate is 7 *, = ^ 77 - G The best rate known for general objectives when using 
FW with 7 fc = 7 A 7 is actually 0(\og(k)/k) (Freund and Grigas, 2013, Bound 3.2). We make use of the specific 
form of the MMD objective here to prove the 0(1/k) rate, as well as the faster 0(1/k 2 ) rate under additional 
assumptions. For the rest of this section, we use || • || to mean || • ||%. 

Theorem G.l (Rates for FW-Quad with approximate vertex search). Consider the FW-Quad Algorithm 1 
where an approximate vertex search is used: (g k — p p ,gk+ 1 ) < min 5 eAt(<7fc — /r p ,g) + <5, where g k +i '■= < f’(a:fe 4 -i) 
and S > 0. Suppose that p p lies in the strict interior of M. with a radius r > 0, i.e. a ball of radius r centered 
at p p lies withm A4. Recall that max gS ^( ||g|| < R. Then we have the faster rate 0(l/k 2 ) for the objective J : 

1 2 R 2 6 

\\9k-th>\\ < + -■ (9) 

If r = 0 (note that p p £ M.), then we can still get a standard 0(1/k) FW rate: 

\\gk ~ Tp\\ 2 ^ ( 10 ) 

Proof If a ball of a radius r centered at p p lies within A4, then we have that: 

mi ,n (9k ~ H P ,9~ H P ) < ~r\\g k - P P \\- 

g&M 

So the approximate vertex search yields g k + 1 with the property: 

(9k ~ 9 P ,gk+ 1 - 9-p) < -r\\g k - P p \\ + 5. (11) 

By using the FW update g k+ 1 = 7 fc 3 fc+i + (1 - 7 k)g k with the 7 fc = 773 ; step-size, we get: 

1 k k | 1 

Il5fc+i ^ T P II 2 = Il7fc<?fc+i + (1 - 7 k)g k - Ppf = || ^fc+i + jV^9k - 

= (h _|_ x) 2 IICfffc-l- 1 — Tp) + ^’(<?fc — 9p)\\ ■ 

5 We note that the rate extends to the line-search step-size as well as the improvement at each iteration can only be 
better in this case considering the proof technique that we use. 

6 We also tried the 7 j, = step-size in the mixture of Gaussians experiment of Section 2.3, but it gave similar results 
as the 7 k = fcijrj; step-size. 
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Thus if we let v k ’■= k(g k — n P ), then we get: 

IK'fc+ill 2 = ||(flfc+i ~ V P ) + v k \\ 2 

lll/fc+i M p II + ll^fcll T 2(u£, 9k+1 Mp) 

< 4 R 2 + \\v k \\ 2 + 2(kS - r|M|) = ||ffc|| 2 + 2||«*|| 


( 12 ) 


The last inequality used the crucial strict interior assumption that yielded (11). Now let C k := 4 (2 R 2 + kS). 
Note that C k +i > C k . We will now proceed to show by induction that ||t’fc|| < C k for k > 1. Note that the 
bracket in (12) is negative if and only if ||wfc|| > C k (i.e. ||ufc+i|| < ||ufc|| in this case), giving the inspiration for 
the C k threshold. 

First, we have that 

\\v 1 \\ = \\g 1 - f i p \\<2R<2R-<C 1 , 

r 

by using the fact that — > 1 since a ball of radius r fitting in M. implies that the maximum norm R of elements 
in A4 is at least r. 

Now suppose that ||vfc|| < C k , i.e. that ||i>fc|| = aC k for some a £ [0,1]. Then (12) becomes: 

Ik+ill 2 < a 2 Cl + 2aC k [ r ^~ - r] = a 2 Cl + 2C k r[l - a}. 

aC k 


The RHS is a convex function of a, and so it is maximized at the boundary of its domain. For a = 0, we get 
||^fc+i|| 2 < 2 C k r = 4 R 2 + 2 kS. For a = 1, we get ||i>fc + i|| 2 < C 2 . And thus in general, supposing a £ [0,1], we 
get that ||t) fc+ i|| 2 < max{2C k r, Cl} = C% as: 


Cl = AR 2 


R 


- + Ak6 - 


R 


+ k 2 [- 
. v 


> 2 C k r 


AR 2 + 2 kS 


using y > 1. This completes the induction step as this means that ||wfc+i|| < C k < C k + 1 - 
Thus we conclude that ||ufc|| < C k for all k > 1, i.e. 


II 9k 


Mp II A 



6 

r 


This shows the faster O(lfk) rate (9). If we do not have g. p in the strict interior of A4, i.e. r = 0, then we can 
unroll the inequality (12) to get: 


K+if < 4i? 2 + 2 kS + ||u fc | 


< ( 4i?2 + 2lS ) 


i=i 


llfoll^ 

<4R 2 


<{k + l)AR 2 + 


,2 , k(k+ 1) 


26. 


This thus shows (10): 

II 9k -Mp|| 2 < ^AR 2 + 5. 


This translates to a slower 0(\/Vk) rate on \\g k — g p \\ (with \/S precision), but note that at least it does not 
have the log(fc) factor from the rate by Freund and Grigas (2013). ■ 
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Consequence for SKH with random search points. Going back over the argument from Appendix B, we 
said that using M random search points in FW-Quad was similar to approximately solving (within 5mRm) the 
linear subproblem for the Frank-Wolfe optimization of Jm(s) := ^||<7 — M Pm)\\u over the marginal polytope of 
Xm- We note that the marginal polytope of Xm is at most M-dimensional, and thus, by using a similar argument 
as in Proposition 1 of Bach et al. (2012), we could show that it contains a ball of radius > 0 centered at 
m(pm )- 7 From (9) in Theorem G.l, we can thus conclude that: 

II 9k - h(pm )|| < —— ( y2 R m + ■ 

I'm \k J 

This seems to give a faster rate, but the problem is that might shrink at an exponential rate with M if H 
is infinite dimensional. Thus even though 5m is 0(1/\/M), the dependence of 5m 7(7 might be worse than the 
previously quoted rate of 0 ( 1 /M 1//4 ); the latter is thus the general worst-case. 

On the other hand, under the additional assumption that H is finite dimensional and that there is a ball of radius 
r > 0 centered around g p in A4 (by using Proposition 1 of Bach et al. (2012) for example), then for a sufficiently 
large M, we will have tm close to r. Thus for large M, we will have \\gk — P-(pm )|| < 7 (77 + 5m), and thus 

\\ 9 n - 9 p\\ = O (i + • This gives an asymptotically faster rate than the O (f? + 777771)) 

rate given in Appendix B arising from (10), but the constant is worse by a factor of 7 . 
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